Method and System for Non-Parametric Voice Conversion

ABSTRACT

A method and system is disclosed for non-parametric speech conversion. A text-to-speech (TTS) synthesis system may include hidden Markov model (HMM) HMM based speech modeling for both synthesizing output speech. A converted HMM may be initially set to a source HMM trained with a voice of a source speaker. A parametric representation of speech may be extract from speech of a target speaker to generate a set of target-speaker vectors. A matching procedure, carried out under a transform that compensates for speaker differences, may be used to match each HMM state of the source HMM to a target-speaker vector. The HMM states of the converted HMM may be replaced with the matched target-speaker vectors. Transforms may be applied to further adapt the converted HMM to the voice of target speaker. The converted HMM may be used to synthesize speech with voice characteristics of the target speaker.

BACKGROUND

Unless otherwise indicated herein, the materials described in thissection are not prior art to the claims in this application and are notadmitted to be prior art by inclusion in this section.

A goal of automatic speech recognition (ASR) technology is to map aparticular utterance to an accurate textual representation, or othersymbolic representation, of that utterance. For instance, ASR performedon the utterance “my dog has fleas” would ideally be mapped to the textstring “my dog has fleas,” rather than the nonsensical text string “mydog has freeze,” or the reasonably sensible but inaccurate text string“my bog has trees.”

A goal of speech synthesis technology is to convert written languageinto speech that can be output in an audio format, for example directlyor stored as an audio file suitable for audio output. The writtenlanguage could take the form of text, or symbolic linguisticrepresentations. The speech may be generated as a waveform by a speechsynthesizer, which produces articifical human speech. Natural soundinghuman speech may also be a goal of a speech synthesis system.

Various technologies, including computers, network servers, telephones,and personal digital assistants (PDAs), can be employed to implement anASR system and/or a speech synthesis system, or one or more componentsof such systems. Communication networks may in turn providecommunication paths and links between some or all of such devices,supporting speech synthesis system capabilities and services that mayutilize ASR and/or speech synthesis system capabilities.

BRIEF SUMMARY

In one aspect, an example embodiment presented herein provides a methodcomprising: training an source hidden Markov model (HMM) based speechfeatures generator implemented by one or more processors of a systemusing speech signals of a source speaker, wherein the source HMM basedspeech features generator comprises a configuration of source HMM statemodels, each of the source HMM state models having a set ofgenerator-model functions; extracting speech features from speechsignals of a target speaker to generate a target set of target-speakervectors; for each given source HMM state model of the configuration,determining a particular target-speaker vector from among the target setthat most closely matches parameters of the set of generator-modelfunctions of the given source HMM; determining a fundamental frequency(F0) transform that speech-adapts F0 statistics of the source HMM basedspeech features generator to match F0 statistics of the speech of thetarget speaker; constructing a converted HMM based speech featuresgenerator implemented by one or more processors of the system to be thesame as the source HMM based speech features generator, but wherein theparameters of the set of generator-model functions of each source HMMstate model of the converted HMM based speech features generator arereplaced with the determined particular most closely matchingtarget-speaker vector from among the target set; and speech-adapting F0statistics of the converted HMM based speech features generator usingthe F0 transform to thereby produce a speech-adapted converted HMM basedspeech features generator.

In another aspect, an example embodiment presented herein provides amethod comprising: implementing a source hidden Markov model (HMM) basedspeech features generator by one or more processors of a system, whereinthe source HMM based speech features generator comprises a configurationof source HMM state models, each of the source HMM state models having aset of generator-model functions, and wherein the implemented source HMMbased speech features generator is trained using speech signals of asource speaker; providing a set of target-speaker vectors, the set oftarget-speaker vectors having been generated from speech featuresextracted from speech signals of a target speaker; implementing aconverted HMM based speech features generator that is the same as thesource HMM based speech features generator, but wherein (i) parametersof the set of generator-model functions of each given source HMM statemodel of the converted HMM based speech features generator are replacedwith a particular target-speaker vector from among the target set thatmost closely matches the parameters of the set of generator-modelfunctions of the given source HMM, and (ii) fundamental frequency (F0)statistics of the converted HMM based speech features generator arespeech-adapted using an F0 transform that speech-adapts F0 statistics ofthe source HMM based speech features generator to match F0 statistics ofthe speech of the target speaker; receiving an enriched transcription ofa run-time text string by an input device of the system; using theconverted HMM based speech features generator to convert the enrichedtranscription into corresponding output speech features; and generatinga synthesized utterance of the enriched transcription using the outputspeech features, the synthesized utterance having voice characteristicsof the target speaker.

In still another respect, an example embodiment presented hereinprovides a system comprising: one or more processors; memory; andmachine-readable instructions stored in the memory, that upon executionby the one or more processors cause the system to carry out functionsincluding: implementing a source hidden Markov model (HMM) based speechfeatures generator by one or more processors of a system, wherein thesource HMM based speech features generator comprises a configuration ofsource HMM state models, each of the source HMM state models having aset of generator-model functions, and wherein the implemented source HMMbased speech features generator is trained using speech signals of asource speaker; providing a set of target-speaker vectors, the set oftarget-speaker vectors having been generated from speech featuresextracted from speech signals of a target speaker; implementing aconverted HMM based speech features generator that is the same as thesource HMM based speech features generator, but wherein (i) parametersof the set of generator-model functions of each given source HMM statemodel of the converted HMM based speech features generator are replacedwith a particular target-speaker vector from among the target set thatmost closely matches the parameters of the set of generator-modelfunctions of the given source HMM, and (ii) fundamental frequency (F0)statistics of the converted HMM based speech features generator arespeech-adapted using an F0 transform that speech-adapts F0 statistics ofthe source HMM based speech features generator to match F0 statistics ofthe speech of the target speaker.

In yet another aspect, an example embodiment presented herein providesan article of manufacture including a computer-readable storage mediumhaving stored thereon program instructions that, upon execution by oneor more processors of a system cause the system to perform operationscomprising: implementing a source hidden Markov model (HMM) based speechfeatures generator by one or more processors of a system, wherein thesource HMM based speech features generator comprises a configuration ofsource HMM state models, each of the source HMM state models having aset of generator-model functions, and wherein the implemented source HMMbased speech features generator is trained using speech signals of asource speaker; providing a set of target-speaker vectors, the set oftarget-speaker vectors having been generated from speech featuresextracted from speech signals of a target speaker; implementing aconverted HMM based speech features generator that is the same as thesource HMM based speech features generator, but wherein (i) parametersof the set of generator-model functions of each given source HMM statemodel of the converted HMM based speech features generator are replacedwith a particular target-speaker vector from among the target set thatmost closely matches the parameters of the set of generator-modelfunctions of the given source HMM, and (ii) fundamental frequency (F0)statistics of the converted HMM based speech features generator arespeech-adapted using an F0 transform that speech-adapts F0 statistics ofthe source HMM based speech features generator to match F0 statistics ofthe speech of the target speaker.

In yet a further aspect, an example embodiment presented herein providesan article of manufacture including a computer-readable storage medium,having stored thereon program instructions that, upon execution by oneor more processors of a system, cause the system to perform operationscomprising: training an source hidden Markov model (HMM) based speechfeatures generator using speech signals of a source speaker, wherein thesource HMM based speech features generator comprises a configuration ofsource HMM state models, each of the source HMM state models having aset of generator-model functions; extracting speech features from speechsignals of a target speaker to generate a target set of target-speakervectors; for each given source HMM state model of the configuration,determining a particular target-speaker vector from among the target setthat most closely matches parameters of the set of generator-modelfunctions of the given source HMM; determining a fundamental frequency(F0) transform that speech-adapts F0 statistics of the source HMM basedspeech features generator to match F0 statistics of the speech of thetarget speaker; constructing a converted HMM based speech featuresgenerator to be the same as the source HMM based speech featuresgenerator, but wherein the parameters of the set of generator-modelfunctions of each source HMM state model of the converted HMM basedspeech features generator are replaced with the determined particularmost closely matching target-speaker vector from among the target set;and speech-adapting F0 statistics of the converted HMM based speechfeatures generator using the F0 transform to thereby produce aspeech-adapted converted HMM based speech features generator.

These as well as other aspects, advantages, and alternatives will becomeapparent to those of ordinary skill in the art by reading the followingdetailed description, with reference where appropriate to theaccompanying drawings. Further, it should be understood that thissummary and other descriptions and figures provided herein are intendedto illustrative embodiments by way of example only and, as such, thatnumerous variations are possible. For instance, structural elements andprocess steps can be rearranged, combined, distributed, eliminated, orotherwise changed, while remaining within the scope of the embodimentsas claimed.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is a flowchart illustrating an example method in accordance withan example embodiment.

FIG. 2 is a flowchart illustrating an example method in accordance withanother example embodiment.

FIG. 3 is a block diagram of an example network and computingarchitecture, in accordance with an example embodiment.

FIG. 4A is a block diagram of a server device, in accordance with anexample embodiment.

FIG. 4B depicts a cloud-based server system, in accordance with anexample embodiment.

FIG. 5 depicts a block diagram of a client device, in accordance with anexample embodiment.

FIG. 6 depicts a simplified block diagram of an example text-to-speechsystem, in accordance with an example embodiment.

FIG. 7 is a block diagram depicting additional details of an exampletext-to-speech speech system, in accordance with an example embodiment.

FIG. 8 is a schematic illustration of configuring a hidden Markov modelfor non-parametric voice conversion, in accordance with an exampleembodiment.

FIG. 9 depicts a block diagram of an example text-to-speech systemconfigured for non-parametric voice conversion, in accordance with anexample embodiment.

FIG. 10 is a conceptual illustration of parametric and non-parametricmapping between vector spaces, in accordance with an example embodiment.

DETAILED DESCRIPTION 1. Overview

A speech synthesis system can be a processor-based system configured toconvert written language into artificially produced speech or spokenlanguage. The written language could be written text, such as one ormore written sentences or text strings, for example. The writtenlanguage could also take the form of other symbolic representations,such as a speech synthesis mark-up language, which may includeinformation indicative of speaker emotion, speaker gender, speakeridentification, as well as speaking styles. The source of the writtentext could be input from a keyboard or keypad of a computing device,such as a portable computing device (e.g., a PDA, smartphone, etc.), orcould be from file stored on one or another form of computer readablestorage medium. The artificially produced speech could be generated as awaveform from a signal generation device or module (e.g., a speechsynthesizer device), and output by an audio playout device and/orformatted and recorded as an audio file on a tangible recording medium.Such a system may also be referred to as a “text-to-speech” (TTS)system, although the written form may not necessarily be limited to onlytext.

A speech synthesis system may operate by receiving an input text string(or other form of written language), and translating the written textinto an enriched transcription corresponding to a symbolicrepresentation of how the spoken rendering of the text sounds or shouldsound. The enriched transcription may then be mapped to speech featuresthat parameterize an acoustic rendering of the enriched transcription,and which then serve as input data to a signal generation module deviceor element that can produce an audio waveform suitable for playout by anaudio output device. The playout may sound like a human voice speakingthe words (or sounds) of the input text string, for example. In thecontext of speech synthesis, the more natural the sound (e.g., to thehuman ear) of the synthesized voice, generally the better the voicequality. The audio waveform could also be generated as an audio filethat may be stored or recorded on storage media suitable for subsequentplayout.

In operation, a TTS system may be used to convey information from anapparatus (e.g. a processor-based device or system) to a user, such asmessages, prompts, answers to questions, instructions, news, emails, andspeech-to-speech translations, among other information. Speech signalsmay themselves carry various forms or types of information, includinglinguistic content, effectual state (e.g., emotion and/or mood),physical state (e.g., physical voice characteristics), and speakeridentity, to name a few.

Some applications of a TTS system may benefit from an ability to conveyspeaker identity in the sound of the synthesized voice. For example, TTSdelivery of emails or other text-based messages could be synthesized ina voice that sounds like the sender. As another example, text-basedconversation (e.g., “chat”) applications, in which two or moreparticipants provide text-based input, could similarly deliversynthesized speech output in the respective voices that sound like thoseof the participants. In general, the quality of communication may beimproved or enhanced when speaker voices are known, familiar, and/ordistinguishable. These applications represent examples of a technologytypically referred to as “intra-lingual voice conversion.” Similarbenefits may accrue for “cross-lingual voice conversion,” such as may beused in a speech-to-speech (S2S) system that translates input speech inone language to output speech in another language. Communication maysound more natural when the synthesized output speech has voicecharacteristics of the input speech.

A TTS system may use a statistical model of a parametric representationof speech to synthesize speech. In particular, statistical modeling maybe based on hidden Markov models (HMMs). One advantageous aspect ofHMM-based speech synthesis is that it can facilitate altering oradjusting characteristics of the synthesized voice using one or anotherform of statistical adaptation. For example, given data in the form ofrecordings of a target speaker, the HMM can be adapted to the data so asto make the HMM-based synthesizer sound like the target speaker. Theability to adapt HMM-based synthesis can therefore make it a flexibleapproach.

Methods used for statistical adaptation of HMMs and of Gaussian mixturemodels (GMMs) may also be applied to voice conversion. However,conventional methods typically employ a parametric approach to voiceconversion, in the sense that they attempt to find an optimal transformthat adapts the statistics of the HMM to the statistics of a targetspeaker. The adaptation is made in terms of maximum-likelihood, usuallyregardless of the model. For example, the adaptation may be made byapplying a transformation to parameters of Gaussian states of the HMM. Aside-effect of this transformation is that it can an over-smoothspectral envelopes of the HMM, which can in turn result in lower-qualityspeech having a muffled character. Other defects may be introduced aswell.

Example embodiments are described herein for a method and system fornon-parametric voice conversion for HMM-based speech synthesis that,among other advantages, can overcome limitations and drawbacks ofconventional approaches of voice conversion based on statisticaladaptation. More specifically, statistical adaptation is predicated onadapting models to describe a target speaker as well as possible interms of likelihood, while in voice conversion the goal is for models tosound like the target speaker. Target speaker recordings can containvariability that may not necessarily be related to voice characteristicsor qualitative aspects. For example, they may contain multiple versionsof a phonetic speech unit (e.g., a phoneme) that are unrelated tolinguistic/effectual content and instead are due to random variations.In terms of adaptation, all versions of this phonetic speech unit wouldtypically be modeled as variations of a phenomenon that the model needsto capture. The result in terms of voice conversion could tend to be themodel of the phonetic speech unit sounding muffled, for example. Interms of voice conversion, however, a more appropriate goal might be tomodel just one of the multiple versions. Thus, the variability overmultiple realizations of the phonetic speech unit could be consideredperceptually redundant. This can be interpreted as a mismatch betweenthe underlying goals of statistical adaptation and voice conversion.

In accordance with example embodiments, an HMM-based TTS system may betrained using extensive standard-voice recordings of a “source” speaker.This can amount to application of high-quality, proven trainingtechniques, for example. Referring to the HMM of the TTS system as a“source HMM,” this training process may be said to train the source HMMin the voice of the source speaker. As a result, the source HMM acquiresa set of Gaussian statistical generator functions that have beeniteratively and cumulatively built based on voice characteristics thesource speaker. The Gaussian generator functions of the source HMMcorrespond to probability density functions (PDFs) for jointly modelingspectral envelope parameters and excitation parameters of phoneticspeech units. The phonetic speech units could be phonemes or triphones,for example.

Also in accordance with example embodiments, speech features may beextracted from speech signals of a “target speaker” in order to generateset of target-speaker vectors that parameterize the speech signals. Forexample, the speech signals could be voice recordings of the targetspeaker.

In further accordance with example embodiments, an analytical matchingprocedure may be carried out to identify for parameters of each Gaussianstatistical generator function of the source HMM a closest match speechvector from among the set of target-speaker vectors. This process isenabled by a novel and effective “matching under transform” algorithm,and results in a set of Gaussian statistical generator functionsfashioned from characteristics of the target voice that can be appliedby the source HMM. This process is enabled by a novel and effective“matching under transform” technique, and results in a set of Gaussianstatistical generator functions fashioned from characteristics of thetarget speaker's voice that can be applied by the source HMM. Thematching under transform (“MUT”) technique entails a matching procedurethat can compensate for inter-speaker speech differences (e.g.,differences between the source speaker and the target speaker). Thematching procedure can be specified in terms of a MUT algorithm suitablefor implementation as executable instructions on one or more processorsof a system, such as a TTS system. Taken with additional steps describedbelow, the effect can be to construct a speech synthesizer with voicecharacteristics of the target speaker.

As an additional aspect of matching, a transform that adapts statisticsof fundamental frequency (F0) of the source HMM to the F0 statistics ofthe target speaker is computed. In the context of speech recognition andsynthesis, F0 relates to the pitch of the voice.

In a further aspect of example embodiments, a converted HMM isconstructed by first creating a copy of the source HMM, such that theconverted HMM initially has the Gaussian statistical generator functionsof the source HMM. Next, the parameters of the Gaussian statisticalgenerator functions of the converted HMM, which are initially the sameas those of the source HMM, are replaced with the target-speaker vectorsidentified using the matching under transform algorithm. Finally, the F0transformation is applied to the converted HMM. The converted HMM cannow be considered as being configured to generate acoustic features ofspeech units characterized by the sound of the target voice.

In accordance with example embodiments, the source HMM of the TTS systemmay be replaced with the converted HMM, prepared as described above. Atrun-time, the TTS speech synthesizer may then be used to synthesizespeech with voice characteristics of the target speaker.

2. Example Methods

In example embodiments, a TTS synthesis system may include one or moreprocessors, one or more forms of memory, one or more inputdevices/interfaces, one or more output devices/interfaces, andmachine-readable instructions that when executed by the one or moreprocessors cause the TTS synthesis system to carry out the variousfunctions and tasks described herein. The TTS synthesis system may alsoinclude implementations based on one or more hidden Markov models. Inparticular, the TTS synthesis system may employ methods that incorporateHMM-based speech synthesis, HMM-based speech recognition, and HMM-basedvoice conversion, as well as other possible components. Two examples ofsuch a method are described in the current section.

FIG. 1 is a flowchart illustrating an example method in accordance withexample embodiments. At step 102, a source HMM based speech featuresgenerator, implemented by one or more processors of the system, istrained using speech signals of a source speaker. Speech featuresgenerated by the source HMM-based speech features generator may be usedfor synthesis of speech. More particularly, speech features generallyinclude quantitative measures of acoustic properties of speech that maybe processed, for example by a speech synthesizer apparatus, to producesynthesized speech. The source HMM based speech features generator mayinclude a configuration of source HMM state models, each of which has aset of generator-model functions. The source HMM state models are usedto model states, and transitions between states, of phonetic units. Theset of generator-model functions for each source HMM state modelspecifies how speech features corresponding to the modeled phonetic unitare generated.

At step 104, speech features are extracted from speech signals of atarget speaker to generate a set of target-speaker vectors. The speechsignals could be provided in real-time by the target speaker, or couldbe contained in voice recordings of the target speaker. More generally,the process of extracting speech features from speech signals isreferred to herein as “feature extraction,” and may be considered asgenerating a parameterized representation of speech signals, theparameters or “features” being elements of “feature vectors.” Thetarget-speaker vectors generated at step 104 can thus be consideredfeature vectors generated from speech of the target speaker. Inaccordance with example embodiments, feature extraction can entaildecomposing the speech signals of a speaker (e.g., the target thespeaker in the current example method) into at least one of spectralenvelopes, aperiodicity envelopes, fundamental frequencies, or voicing.Feature extraction could produce other types of features as well.

At step 106, a procedure for matching parameters of the generator-modelfunctions of the source HMM state models and target-speaker vectors fromamong the target set is carried out. Specifically, for each given sourceHMM state model of the configuration, a particular target-speaker vectorfrom among the target set that most closely matches the parameters ofthe set of generator-model functions of the given source HMM isdetermined. As described in more detail below, the matching isdetermined using a procedure that simultaneously applies a parametric,transformation-based mapping from an analytic space of the source HMMstate models to an analytic vector space of the target set oftarget-speaker vectors, and a nonparametric, probabilistic-associativemapping from the analytic vector space of the target set oftarget-speaker vectors to the analytic space of the source HMM statemodels. Referred to herein as “matching under transform” (“MUT”), thematching procedure can compensate for differences between speech (e.g.,voice) of the source speaker and speech (e.g., voice) of the targetspeaker.

At step 108, a fundamental frequency (F0) transform that speech-adaptsF0 statistics of the source HMM based speech features generator to matchF0 statistics of the target speaker is determined.

At step 110, a “conversion” HMM based speech features generator isconstructed for mapping between the target speaker and the source HMMbased speech features generator. More particularly, the converted HMMbased speech features generator, also implemented by one or moreprocessors of the system, is constructed to initially be the same as thesource HMM based speech features generator. Thus, the converted HMM basespeech features generator initially has the source HMM state models andgenerator-model functions of the source HMM based speech featuresgenerator. Then, the parameters of the set of generator-model functionsof each source HMM state model of the converted HMM based speechfeatures generator is replaced with the determined particular mostclosely matching target-speaker vector from among the target set, asdetermined at step 106.

Finally, at step 112, the F0 statistics of the converted HMM basedspeech features generator are speech-adapted using the F0 transformdetermined at step 108. The result may be referred to as aspeech-adapted converted HMM based speech features generator.

In accordance with example embodiments, the source HMM based speechfeatures generator and the converted HMM based speech features generatorcould be implemented by at least one common processor from among the oneor more processors of the system. For example, both could be implementedby a single, common processor. Alternatively, either or both could beimplemented in a distributed fashion, such that they share at least onecommon processor. As still a further alternative, they could beimplemented without sharing any processor(s). Other implementation ofthe two HMM based speech features generators among configurations of theone or more processors of the system are possible as well.

In further accordance with example embodiments, the TTS synthesis systemmay be used to carry out run-time voice conversion to the voice of thetarget speaker. More particularly, operating in a run-time mode, the TTSsynthesis system could create an enriched transcription of a run-timetext string. Next, the speech-adapted converted HMM based speechfeatures generator could be used to convert the enriched transcriptioninto corresponding output speech features. Finally, a synthesizedutterance of the enriched transcription could be generated using theoutput speech features. The synthesized utterance could thereby havevoice characteristics of the target speaker.

In still further accordance with example embodiments, creating theenriched transcription of the run-time text string could entailreceiving the run-time text string at the TTS synthesis system, andconverting the received run-time text string into the enrichedtranscription of the run-time text string by the TTS synthesis system.As used herein, an enriched transcription is a symbolic representationof the phonetic and linguistic content of written text or other symbolicform of speech. It can take the form of a sequence or concatenation oflabels (or other text-based identifier), each label identifying aphonetic speech unit, such as a phoneme or triphone, and furtheridentifying or encoding linguistic and/or syntactic context, temporalparameters, and other information for specifying how to render thesymbolically-represented sounds as meaningful speech in a givenlanguage. Generating the synthesized utterance could correspond tosynthesizing speech using the output speech features.

More specifically, taking the enriched transcription to be observeddata, the converted HMM based speech features generator can be used tomodel speech features corresponding to the received text string. Themodeled speech features can serve as input to a speech synthesizer, sucha vocoder, in order to generate synthesized speech. By preparing theconverted HMM based speech features generator as described above, thesynthesized output voice is made to sound like that of the targetspeaker. The effect is to render text entered or supplied by the targetspeaker into synthesized speech that sounds like the target speaker'svoice.

In accordance with example embodiments, the set of generator-modelfunctions for each given source HMM state model could include amultivariate spectral probability density function (PDF) for jointlymodeling spectral envelope parameters of a phonetic unit modeled by agiven source HMM state model, and a multivariate excitation PDF forjointly modeling excitation parameters of the phonetic unit. By way ofexample, phonetic speech units could phonemes and/or triphones.

With generator-model functions defined as multivariate PDFs, thematching procedure of step 106 may be described in terms of findingtarget-speaker vectors that differ minimally from parameters of themultivariate PDFs. More specifically, making a determination of aparticular target-speaker vector from among the target set that mostclosely matches parameters of the set of generator-model functions ofthe given source HMM could entail determining a target-speaker vectorfrom among the target set that is computationally nearest to parametersof the multivariate spectral PDF of the given source HMM state model interms of a distance criterion that could be based on mean-squared-error(mse) or the Kullback-Leibler distance. Making the determination couldadditionally entail determining a target-speaker vector from among thetarget set that is computationally nearest to the multivariateexcitation PDF of the given output HMM state model in terms of adistance criterion that could be based on mse or the Kullback-Leiblerdistance.

In further accordance with example embodiments, determining theparticular target-speaker vector from among the target set that mostclosely matches parameters of the set of generator-model functions ofthe given source HMM could entail making a determination an optimalcorrespondence between a multivariate PDF of the given source HMM and aparticular target-speaker vector from among the target set. As mentionedabove and discussed in more detail below, the determination could madeunder a transform that compensates for differences between speech of thesource speaker and speech of the target speaker. That is, a matchingunder transform technique could be used to make an optimal matchingdetermination.

In further accordance with example embodiments, the multivariatespectral PDF of each source HMM state model could have the mathematicalform of a multivariate Gaussian function. While generator-modelfunctions of HMMs can take the form of Gaussian PDFs, this is notnecessarily a requirement.

In further accordance with example embodiments, the spectral envelopeparameters of the phonetic units could be Mel Cepstral coefficients,Line Spectral Pairs, Linear Predictive coefficients, Mel-GeneralizedCepstral Coefficients, or other acoustic-related quantities. Inaddition, the spectral envelope parameters of the phonetic units of theoutput language could also include first and second time derivatives ofthe acoustic-related quantities of the output language. As noted above,extraction of features from speech signals of the target speaker canentail decomposing the speech signals of a speaker (e.g., the target thespeaker in the current example method) into at least one of spectralenvelopes, aperiodicity envelopes, fundamental frequencies, or voicing.

In accordance with example embodiments, construction of the convertedHMM based speech features generator at step 110 could entailtransforming the source HMM based speech features generator into theconverted HMM based speech features generator. More particularly, theparameters of the set of generator-model functions of each source HMMstate model of the source HMM based speech features generator could bereplaced with the particular most closely matching target-speaker vectorfrom among the target set, as determined at step 106. The F0 statisticsof the transformed source HMM based speech features generator could thenbe speech-adapted using the F0 transform determined as step 108. In thisapproach, the converted HMM based speech features generator can beviewed as being constructed “in place” from the source HMM based speechfeatures generator.

FIG. 2 is a flowchart illustrating an example an alternative method inaccordance with example embodiments. At step 202 a source HMM basedspeech features generator is implemented by one or more processors of aTTS synthesis system. The implemented source HMM based speech featuresgenerator can include a configuration of source HMM state models, andeach of the source HMM state models may have a set of generator-modelfunctions. Further, the implemented source HMM based speech featuresgenerator is trained using speech signals of a source speaker speaking.

At step 204, a set of target-speaker vectors is provided. Moreparticularly, the set of target-speaker vectors can be generated fromspeech features extracted from speech signals of the As described above,the target-speaker vectors (or, more generally, feature vectors) can bethe product of feature extraction, and may be considered to be aparameterized representation of speech signals. Again, featureextraction can entail decomposing the speech signals of a speaker (e.g.,the target the speaker in the current example method) into at least oneof spectral envelopes, aperiodicity envelopes, fundamental frequencies,or voicing. And feature extraction could produce other types of featuresas well. target speaker.

At step 206, a converted HMM based speech features generator isimplemented that is the same as the source HMM based speech featuresgenerator, but with some specific differences. In particular, theparameters of the set of generator-model functions of each given sourceHMM state model of the converted HMM based speech features generator isreplaced with a particular target-speaker vector from among the targetset that most closely matches the parameters of the set ofgenerator-model functions of the given source HMM. In addition,fundamental frequency (F0) statistics of the converted HMM based speechfeatures generator are speech-adapted using an F0 transform thatspeech-adapts F0 statistics of the source HMM based speech featuresgenerator to match F0 statistics of the target speaker.

At step 208, the TTS synthesis system could receive an enrichedtranscription of a run-time text string. The run-time text string couldbe received at an input device of the TTS synthesis system, and theenriched transcription could include a sequence or concatenation oflabels (or other text-based identifier). As described above, each labelcould identify a phonetic speech unit, such as a phoneme or triphone,and further identify or encode linguistic and/or syntactic context,temporal parameters, and other information for specifying how to renderthe symbolically-represented sounds as meaningful speech in a givenlanguage.

At step 210, the TTS synthesis system could use the converted HMM basedspeech features generator to convert the enriched transcription intocorresponding output features. As described above, taking the enrichedtranscription to be observed data, the converted HMM based speechfeatures generator can be used to model speech features corresponding tothe received text string.

Finally, at step 212, a synthesized utterance of the enrichedtranscription could be generated using the output speech features. Thesynthesized utterance could thereby have voice characteristics of thetarget speaker.

It will be appreciated that the steps shown in FIGS. 1 and 2 are meantto illustrate a methods in accordance with example embodiments. As such,various steps could be altered or modified, the ordering of certainsteps could be changed, and additional steps could be added, while stillachieving the overall desired operation.

3. Example Communication System and Device Architecture

Methods in accordance with an example embodiment, such as the ondescribed above, devices, could be implemented using so-called “thinclients” and “cloud-based” server devices, as well as other types ofclient and server devices. Under various aspects of this paradigm,client devices, such as mobile phones and tablet computers, may offloadsome processing and storage responsibilities to remote server devices.At least some of the time, these client services are able tocommunicate, via a network such as the Internet, with the serverdevices. As a result, applications that operate on the client devicesmay also have a persistent, server-based component. Nonetheless, itshould be noted that at least some of the methods, processes, andtechniques disclosed herein may be able to operate entirely on a clientdevice or a server device.

This section describes general system and device architectures for suchclient devices and server devices. However, the methods, devices, andsystems presented in the subsequent sections may operate under differentparadigms as well. Thus, the embodiments of this section are merelyexamples of how these methods, devices, and systems can be enabled.

a. Example Communication System

FIG. 3 is a simplified block diagram of a communication system 300, inwhich various embodiments described herein can be employed.Communication system 300 includes client devices 302, 304, and 306,which represent a desktop personal computer (PC), a tablet computer, anda mobile phone, respectively. Client devices could also include wearablecomputing devices, such as head-mounted displays and/or augmentedreality displays, for example. Each of these client devices may be ableto communicate with other devices (including with each other) via anetwork 308 through the use of wireline connections (designated by solidlines) and/or wireless connections (designated by dashed lines).

Network 308 may be, for example, the Internet, or some other form ofpublic or private Internet Protocol (IP) network. Thus, client devices302, 304, and 306 may communicate using packet-switching technologies.Nonetheless, network 308 may also incorporate at least somecircuit-switching technologies, and client devices 302, 304, and 306 maycommunicate via circuit switching alternatively or in addition to packetswitching.

A server device 310 may also communicate via network 308. In particular,server device 310 may communicate with client devices 302, 304, and 306according to one or more network protocols and/or application-levelprotocols to facilitate the use of network-based or cloud-basedcomputing on these client devices. Server device 310 may includeintegrated data storage (e.g., memory, disk drives, etc.) and may alsobe able to access a separate server data storage 312. Communicationbetween server device 310 and server data storage 312 may be direct, vianetwork 308, or both direct and via network 308 as illustrated in FIG.3. Server data storage 312 may store application data that is used tofacilitate the operations of applications performed by client devices302, 304, and 306 and server device 310.

Although only three client devices, one server device, and one serverdata storage are shown in FIG. 3, communication system 300 may includeany number of each of these components. For instance, communicationsystem 300 may comprise millions of client devices, thousands of serverdevices and/or thousands of server data storages. Furthermore, clientdevices may take on forms other than those in FIG. 3.

b. Example Server Device and Server System

FIG. 4A is a block diagram of a server device in accordance with anexample embodiment. In particular, server device 400 shown in FIG. 4Acan be configured to perform one or more functions of server device 310and/or server data storage 312. Server device 400 may include a userinterface 402, a communication interface 404, processor 406, and datastorage 408, all of which may be linked together via a system bus,network, or other connection mechanism 414.

User interface 402 may comprise user input devices such as a keyboard, akeypad, a touch screen, a computer mouse, a track ball, a joystick,and/or other similar devices, now known or later developed. Userinterface 402 may also comprise user display devices, such as one ormore cathode ray tubes (CRT), liquid crystal displays (LCD), lightemitting diodes (LEDs), displays using digital light processing (DLP)technology, printers, light bulbs, and/or other similar devices, nowknown or later developed. Additionally, user interface 402 may beconfigured to generate audible output(s), via a speaker, speaker jack,audio output port, audio output device, earphones, and/or other similardevices, now known or later developed. In some embodiments, userinterface 402 may include software, circuitry, or another form of logicthat can transmit data to and/or receive data from external userinput/output devices.

Communication interface 404 may include one or more wireless interfacesand/or wireline interfaces that are configurable to communicate via anetwork, such as network 308 shown in FIG. 3. The wireless interfaces,if present, may include one or more wireless transceivers, such as aBLUETOOTH® transceiver, a Wifi transceiver perhaps operating inaccordance with an IEEE 802.11 standard (e.g., 802.11b, 802.11g,802.11n), a WiMAX transceiver perhaps operating in accordance with anIEEE 802.16 standard, a Long-Term Evolution (LTE) transceiver perhapsoperating in accordance with a 3rd Generation Partnership Project (3GPP)standard, and/or other types of wireless transceivers configurable tocommunicate via local-area or wide-area wireless networks. The wirelineinterfaces, if present, may include one or more wireline transceivers,such as an Ethernet transceiver, a Universal Serial Bus (USB)transceiver, or similar transceiver configurable to communicate via atwisted pair wire, a coaxial cable, a fiber-optic link or other physicalconnection to a wireline device or network.

In some embodiments, communication interface 404 may be configured toprovide reliable, secured, and/or authenticated communications. For eachcommunication described herein, information for ensuring reliablecommunications (e.g., guaranteed message delivery) can be provided,perhaps as part of a message header and/or footer (e.g., packet/messagesequencing information, encapsulation header(s) and/or footer(s),size/time information, and transmission verification information such ascyclic redundancy check (CRC) and/or parity check values).Communications can be made secure (e.g., be encoded or encrypted) and/ordecrypted/decoded using one or more cryptographic protocols and/oralgorithms, such as, but not limited to, the data encryption standard(DES), the advanced encryption standard (AES), the Rivest, Shamir, andAdleman (RSA) algorithm, the Diffie-Hellman algorithm, and/or theDigital Signature Algorithm (DSA). Other cryptographic protocols and/oralgorithms may be used instead of or in addition to those listed hereinto secure (and then decrypt/decode) communications.

Processor 406 may include one or more general purpose processors (e.g.,microprocessors) and/or one or more special purpose processors (e.g.,digital signal processors (DSPs), graphical processing units (GPUs),floating point processing units (FPUs), network processors, orapplication specific integrated circuits (ASICs)). Processor 406 may beconfigured to execute computer-readable program instructions 410 thatare contained in data storage 408, and/or other instructions, to carryout various functions described herein.

Data storage 408 may include one or more non-transitorycomputer-readable storage media that can be read or accessed byprocessor 406. The one or more computer-readable storage media mayinclude volatile and/or non-volatile storage components, such asoptical, magnetic, organic or other memory or disc storage, which can beintegrated in whole or in part with processor 406. In some embodiments,data storage 408 may be implemented using a single physical device(e.g., one optical, magnetic, organic or other memory or disc storageunit), while in other embodiments, data storage 408 may be implementedusing two or more physical devices.

Data storage 408 may also include program data 412 that can be used byprocessor 406 to carry out functions described herein. In someembodiments, data storage 408 may include, or have access to, additionaldata storage components or devices (e.g., cluster data storagesdescribed below).

Referring again briefly to FIG. 3, server device 310 and server datastorage device 312 may store applications and application data at one ormore locales accessible via network 308. These locales may be datacenters containing numerous servers and storage devices. The exactphysical location, connectivity, and configuration of server device 310and server data storage device 312 may be unknown and/or unimportant toclient devices. Accordingly, server device 310 and server data storagedevice 312 may be referred to as “cloud-based” devices that are housedat various remote locations. One possible advantage of such“cloud-based” computing is to offload processing and data storage fromclient devices, thereby simplifying the design and requirements of theseclient devices.

In some embodiments, server device 310 and server data storage device312 may be a single computing device residing in a single data center.In other embodiments, server device 310 and server data storage device312 may include multiple computing devices in a data center, or evenmultiple computing devices in multiple data centers, where the datacenters are located in diverse geographic locations. For example, FIG. 3depicts each of server device 310 and server data storage device 312potentially residing in a different physical location.

FIG. 4B depicts an example of a cloud-based server cluster. In FIG. 4B,functions of server device 310 and server data storage device 312 may bedistributed among three server clusters 420A, 420B, and 420C. Servercluster 420A may include one or more server devices 400A, cluster datastorage 422A, and cluster routers 424A connected by a local clusternetwork 426A. Similarly, server cluster 420B may include one or moreserver devices 400B, cluster data storage 422B, and cluster routers 424Bconnected by a local cluster network 426B. Likewise, server cluster 420Cmay include one or more server devices 300C, cluster data storage 422C,and cluster routers 424C connected by a local cluster network 426C.Server clusters 420A, 420B, and 420C may communicate with network 308via communication links 428A, 428B, and 428C, respectively.

In some embodiments, each of the server clusters 420A, 420B, and 420Cmay have an equal number of server devices, an equal number of clusterdata storages, and an equal number of cluster routers. In otherembodiments, however, some or all of the server clusters 420A, 420B, and420C may have different numbers of server devices, different numbers ofcluster data storages, and/or different numbers of cluster routers. Thenumber of server devices, cluster data storages, and cluster routers ineach server cluster may depend on the computing task(s) and/orapplications assigned to each server cluster.

In the server cluster 420A, for example, server devices 400A can beconfigured to perform various computing tasks of a server, such asserver device 310. In one embodiment, these computing tasks can bedistributed among one or more of server devices 400A. Server devices400B and 400C in server clusters 420B and 420C may be configured thesame or similarly to server devices 400A in server cluster 420A. On theother hand, in some embodiments, server devices 400A, 400B, and 400Ceach may be configured to perform different functions. For example,server devices 400A may be configured to perform one or more functionsof server device 310, and server devices 400B and server device 400C maybe configured to perform functions of one or more other server devices.Similarly, the functions of server data storage device 312 can bededicated to a single server cluster, or spread across multiple serverclusters.

Cluster data storages 422A, 422B, and 422C of the server clusters 420A,420B, and 320C, respectively, may be data storage arrays that includedisk array controllers configured to manage read and write access togroups of hard disk drives. The disk array controllers, alone or inconjunction with their respective server devices, may also be configuredto manage backup or redundant copies of the data stored in cluster datastorages to protect against disk drive failures or other types offailures that prevent one or more server devices from accessing one ormore cluster data storages.

Similar to the manner in which the functions of server device 310 andserver data storage device 312 can be distributed across server clusters420A, 420B, and 420C, various active portions and/or backup/redundantportions of these components can be distributed across cluster datastorages 422A, 422B, and 422C. For example, some cluster data storages422A, 422B, and 422C may be configured to store backup versions of datastored in other cluster data storages 422A, 422B, and 422C.

Cluster routers 424A, 424B, and 424C in server clusters 420A, 420B, and420C, respectively, may include networking equipment configured toprovide internal and external communications for the server clusters.For example, cluster routers 424A in server cluster 420A may include oneor more packet-switching and/or routing devices configured to provide(i) network communications between server devices 400A and cluster datastorage 422A via cluster network 426A, and/or (ii) networkcommunications between the server cluster 420A and other devices viacommunication link 428A to network 408. Cluster routers 424B and 424Cmay include network equipment similar to cluster routers 424A, andcluster routers 424B and 424C may perform networking functions forserver clusters 420B and 420C that cluster routers 424A perform forserver cluster 420A.

Additionally, the configuration of cluster routers 424A, 424B, and 424Ccan be based at least in part on the data communication requirements ofthe server devices and cluster storage arrays, the data communicationscapabilities of the network equipment in the cluster routers 424A, 424B,and 424C, the latency and throughput of the local cluster networks 426A,426B, 426C, the latency, throughput, and cost of the wide area networkconnections 428A, 428B, and 428C, and/or other factors that maycontribute to the cost, speed, fault-tolerance, resiliency, efficiencyand/or other design goals of the system architecture.

c. Example Client Device

FIG. 5 is a simplified block diagram showing some of the components ofan example client device 500. By way of example and without limitation,client device 500 may be or include a “plain old telephone system”(POTS) telephone, a cellular mobile telephone, a still camera, a videocamera, a fax machine, an answering machine, a computer (such as adesktop, notebook, or tablet computer), a personal digital assistant, awearable computing device, a home automation component, a digital videorecorder (DVR), a digital TV, a remote control, or some other type ofdevice equipped with one or more wireless or wired communicationinterfaces.

As shown in FIG. 5, client device 500 may include a communicationinterface 502, a user interface 504, a processor 506, and data storage508, all of which may be communicatively linked together by a systembus, network, or other connection mechanism 510.

Communication interface 502 functions to allow client device 500 tocommunicate, using analog or digital modulation, with other devices,access networks, and/or transport networks. Thus, communicationinterface 502 may facilitate circuit-switched and/or packet-switchedcommunication, such as POTS communication and/or IP or other packetizedcommunication. For instance, communication interface 502 may include achipset and antenna arranged for wireless communication with a radioaccess network or an access point. Also, communication interface 502 maytake the form of a wireline interface, such as an Ethernet, Token Ring,or USB port. Communication interface 502 may also take the form of awireless interface, such as a Wifi, BLUETOOTH®, global positioningsystem (GPS), or wide-area wireless interface (e.g., WiMAX or LTE).However, other forms of physical layer interfaces and other types ofstandard or proprietary communication protocols may be used overcommunication interface 502. Furthermore, communication interface 502may comprise multiple physical communication interfaces (e.g., a Wifiinterface, a BLUETOOTH® interface, and a wide-area wireless interface).

User interface 504 may function to allow client device 500 to interactwith a human or non-human user, such as to receive input from a user andto provide output to the user. Thus, user interface 504 may includeinput components such as a keypad, keyboard, touch-sensitive orpresence-sensitive panel, computer mouse, trackball, joystick,microphone, still camera and/or video camera. User interface 504 mayalso include one or more output components such as a display screen(which, for example, may be combined with a touch-sensitive panel), CRT,LCD, LED, a display using DLP technology, printer, light bulb, and/orother similar devices, now known or later developed. User interface 504may also be configured to generate audible output(s), via a speaker,speaker jack, audio output port, audio output device, earphones, and/orother similar devices, now known or later developed. In someembodiments, user interface 504 may include software, circuitry, oranother form of logic that can transmit data to and/or receive data fromexternal user input/output devices. Additionally or alternatively,client device 500 may support remote access from another device, viacommunication interface 502 or via another physical interface (notshown).

Processor 506 may comprise one or more general purpose processors (e.g.,microprocessors) and/or one or more special purpose processors (e.g.,DSPs, GPUs, FPUs, network processors, or ASICs). Data storage 508 mayinclude one or more volatile and/or non-volatile storage components,such as magnetic, optical, flash, or organic storage, and may beintegrated in whole or in part with processor 506. Data storage 508 mayinclude removable and/or non-removable components.

In general, processor 506 may be capable of executing programinstructions 518 (e.g., compiled or non-compiled program logic and/ormachine code) stored in data storage 508 to carry out the variousfunctions described herein. Data storage 508 may include anon-transitory computer-readable medium, having stored thereon programinstructions that, upon execution by client device 500, cause clientdevice 500 to carry out any of the methods, processes, or functionsdisclosed in this specification and/or the accompanying drawings. Theexecution of program instructions 518 by processor 506 may result inprocessor 506 using data 512.

By way of example, program instructions 518 may include an operatingsystem 522 (e.g., an operating system kernel, device driver(s), and/orother modules) and one or more application programs 520 (e.g., addressbook, email, web browsing, social networking, and/or gamingapplications) installed on client device 500. Similarly, data 512 mayinclude operating system data 516 and application data 514. Operatingsystem data 516 may be accessible primarily to operating system 522, andapplication data 514 may be accessible primarily to one or more ofapplication programs 520. Application data 514 may be arranged in a filesystem that is visible to or hidden from a user of client device 500.

Application programs 520 may communicate with operating system 512through one or more application programming interfaces (APIs). TheseAPIs may facilitate, for instance, application programs 520 readingand/or writing application data 514, transmitting or receivinginformation via communication interface 502, receiving or displayinginformation on user interface 504, and so on.

In some vernaculars, application programs 520 may be referred to as“apps” for short. Additionally, application programs 520 may bedownloadable to client device 500 through one or more online applicationstores or application markets. However, application programs can also beinstalled on client device 500 in other ways, such as via a web browseror through a physical interface (e.g., a USB port) on client device 500.

4. Example System and Operation

a. Example Text-to-Speech System

A TTS synthesis system (or more generally, a speech synthesis system)may operate by receiving an input text string, processing the textstring into a symbolic representation of the phonetic and linguisticcontent of the text string, generating a sequence of speech featurescorresponding to the symbolic representation, and providing the speechfeatures as input to a speech synthesizer in order to produce a spokenrendering of the input text string. The symbolic representation of thephonetic and linguistic content of the text string may take the form ofa sequence of labels, each label identifying a phonetic speech unit,such as a phoneme, and further identifying or encoding linguistic and/orsyntactic context, temporal parameters, and other information forspecifying how to render the symbolically-represented sounds asmeaningful speech in a given language. While the term “phonetictranscription” is sometimes used to refer to such a symbolicrepresentation of text, the term “enriched transcription” introducedabove will instead be used herein, in order to signify inclusion ofextra-phonetic content, such as linguistic and/or syntactic context andtemporal parameters, represented in the sequence of labels.

The enriched transcription provides a symbolic representation of thephonetic and linguistic content of the text string as rendered speech,and can be represented as a sequence of phonetic speech units identifiedaccording to labels, which could further identify or encode linguisticand/or syntactic context, temporal parameters, and other information forspecifying how to render the symbolically-represented sounds asmeaningful speech in a given language. As discussed above, the phoneticspeech units could be phonemes. A phoneme may be considered to be thesmallest segment of speech of given language that encompasses ameaningful contrast with other speech segments of the given language.Thus, a word typically includes one or more phonemes. For purposes ofsimplicity, phonemes may be thought of as utterances of letters,although this is not a perfect analogy, as some phonemes may presentmultiple letters. As an example, the phonemic spelling for the AmericanEnglish pronunciation of the word “cat” is /k/ /ae/ /t/, and consists ofthe phonemes /k/, /ae/, and /t/. Another example is the phonemicspelling for the word “dog” is /d/ /aw/ /g/, consisting of the phonemes/d/, /aw/, and /g/. Different phonemic alphabets exist, and otherphonemic representations are possible. Common phonemic alphabets forAmerican English contain about 40 distinct phonemes. Other languages maybe described by different phonemic alphabets containing differentphonemes.

The phonetic properties of a phoneme in an utterance can depend on, orbe influenced by, the context in which it is (or is intended to be)spoken. For example, a “triphone” is a triplet of phonemes in which thespoken rendering of a given phoneme is shaped by a temporally-precedingphoneme, referred to as the “left context,” and a temporally-subsequentphoneme, referred to as the “right context.” Thus, the ordering of thephonemes of English-language triphones corresponds to the direction inwhich English is read. Other phoneme contexts, such as quinphones, maybe considered as well.

Speech features represent acoustic properties of speech as parameters,and in the context of speech synthesis, may be used for drivinggeneration of a synthesized waveform corresponding to an output speechsignal. Generally, features for speech synthesis account for three majorcomponents of speech signals, namely spectral envelopes that resemblethe effect of the vocal tract, excitation that simulates the glottalsource, and prosody that describes pitch contour (“melody”) and tempo(rhythm). In practice, features may be represented in multidimensionalfeature vectors that correspond to one or more temporal frames. One ofthe basic operations of a TTS synthesis system is to map an enrichedtranscription (e.g., a sequence of labels) to an appropriate sequence offeature vectors.

In the context of speech recognition, features may be extracted from aspeech signal (e.g., a voice recording) in a process that typicallyinvolves sampling and quantizing an input speech utterance withinsequential temporal frames, and performing spectral analysis of the datain the frames to derive a vector of features associated with each frame.Each feature vector can thus be viewed as providing a snapshot of thetemporal evolution of the speech utterance.

By way of example, the features may include Mel Filter Cepstral (MFC)coefficients. MFC coefficients may represent the short-term powerspectrum of a portion of an input utterance, and may be based on, forexample, a linear cosine transform of a log power spectrum on anonlinear Mel scale of frequency. (A Mel scale may be a scale of pitchessubjectively perceived by listeners to be about equally distant from oneanother, even though the actual frequencies of these pitches are notequally distant from one another.)

In some embodiments, a feature vector may include MFC coefficients,first-order cepstral coefficient derivatives, and second-order cepstralcoefficient derivatives. For example, the feature vector may contain 13coefficients, 13 first-order derivatives (“delta”), and 13 second-orderderivatives (“delta-delta”), therefore having a length of 39. However,feature vectors may use different combinations of features in otherpossible embodiments. As another example, feature vectors could includePerceptual Linear Predictive (PLP) coefficients, Relative Spectral(RASTA) coefficients, Filterbank log-energy coefficients, or somecombination thereof. Each feature vector may be thought of as includinga quantified characterization of the acoustic content of a correspondingtemporal frame of the utterance (or more generally of an audio inputsignal).

In accordance with example embodiments of HMM-based speech synthesis, asequence of labels corresponding to enriched transcription of the inputtext may be treated as observed data, and a sequence of HMMs and HMMstates is computed so as to maximize a joint probability of generatingthe observed enriched transcription. The labels of the enrichedtranscription sequence may identify phonemes, triphones, and/or otherphonetic speech units. In some HMM-based techniques, phonemes and/ortriphones are represented by HMMs as having three states correspondingto three temporal phases, namely beginning, middle, and end. Other HMMswith a different number of states per phoneme (or triphone, for example)could be used as well. In addition, the enriched transcription may alsoinclude additional information about the input text string, such as timeor duration models for the phonetic speech units, linguistic context,and other indicators that may characterize how the output speech shouldsound, for example.

In accordance with example embodiments, speech features corresponding toHMMs and HMM states may be represented by multivariate PDFs for jointlymodeling the different features that make up the feature vectors. Inparticular, multivariate Gaussian PDFs can be used to computeprobabilities of a given state emitting or generating multipledimensions of features from a given state of the model. Each dimensionof a given multivariate Gaussian PDF could thus correspond to differentfeature. It is also possible to model a feature along a given dimensionwith more than one Gaussian PDF in that dimension. In such an approach,the feature is said to be modeled by a mixture of Gaussians, referred toa “Gaussian mixture model” or “GMM.” The sequence of features generatedby the most probable sequence of HMMs and HMM states can be converted tospeech by a speech synthesizer, for example.

FIG. 6 depicts a simplified block diagram of an example text-to-speech(TTS) synthesis system, in accordance with an example embodiment. Inaddition to functional components, FIG. 6 also shows selected exampleinputs, outputs, and intermediate products of example operation. Thefunctional components of the TTS synthesis system 600 include a textanalysis module 602 converting text in an enriched transcription, and aTTS subsystem 604, including a source HMM, for generating synthesizedspeech from the enriched transcription. These functional componentscould be implemented as machine-language instructions in a centralizedand/or distributed fashion on one or more computing platforms orsystems, such as those described above. The machine-languageinstructions could be stored in one or another form of a tangible,non-transitory computer-readable medium (or other article ofmanufacture), such as magnetic or optical disk, or the like, and madeavailable to processing elements of the system as part of amanufacturing procedure, configuration procedure, and/or executionstart-up procedure, for example.

It should be noted that the discussion in this section, and theaccompanying figures, are presented for purposes of example. Other TTSsystem arrangements, including different components, differentrelationships between the components, and/or different processing, maybe possible.

In accordance with example embodiments, the text analysis module 602 mayreceive an input text string 601 (or other form of text-based input) andgenerate an enriched transcription 603 as output. The input text string601 could be a text message, email, chat input, or other text-basedcommunication, for example. As described above, the enrichedtranscription could correspond to a sequence of labels that identifyspeech units, including context information.

As shown, the TTS subsystem 604 may be employ HMM-based speech synthesisto generate feature vectors corresponding to the enriched transcription603. This is illustrated in FIG. 6 by a symbolic depiction of a sourceHMM in the TTS subsystem 604. The source HMM is represented by aconfiguration of speech-unit HMMs, each corresponding to a phoneticspeech unit of the input language. The phonetic units could be phonemesor triphones, for example. Each speech-unit HMM is drawn as a set ofcircles, each representing a state of the speech unit, and arrowsconnecting the circles, each arrow representing a state transition. Acircular arrow at each state represents a self-transition. Above eachcircle is a symbolic representation of a PDF. In the HMM methodology,the PDF specifies the probability that a given state will “emit” orgenerate speech features corresponding to the speech unit modeled by thestate. The depiction in the figure of three states per speech-unit HMMis consistent with some HMM techniques that model three states for eachspeech unit. However, HMM techniques using different numbers of statesper speech units may be employed as well, and the illustrative use ofthree states in FIG. 6 (as well as in other figures herein) is notintended to be limiting with respect to example embodiments describedherein. Further details of an example TTS synthesis system are describedbelow.

In the example of FIG. 6, the TTS subsystem 604 outputs synthesizedspeech 605 in a voice of a source speaker. The source speaker could be aspeaker used to train the source HMM.

In further accordance with example embodiments, the HMMs of a HMM-basedTTS synthesis system may be trained by tuning the PDF parameters, usinga database of text recorded speech and corresponding known text strings.

FIG. 7 is a block diagram depicting additional details of an exampletext-to-speech speech system, in accordance with an example embodiment.As with the illustration in FIG. 6, FIG. 7 also displays functionalcomponents and selected example inputs, outputs, and intermediateproducts of example operation. The functional components of the speechsynthesis system 700 include a text analysis module 702, a HMM module704 that includes HMM parameters 706, a speech synthesizer module 708, aspeech database 710, a feature extraction module 712, and a HMM trainingmodule 714. These functional components could be implemented asmachine-language instructions in a centralized and/or distributedfashion on one or more computing platforms or systems, such as thosedescribed above. The machine-language instructions could be stored inone or another form of a tangible, non-transitory computer-readablemedium (or other article of manufacture), such as magnetic or opticaldisk, or the like, and made available to processing elements of thesystem as part of a manufacturing procedure, configuration procedure,and/or execution start-up procedure, for example.

For purposes of illustration, FIG. 7 is depicted in a way thatrepresents two operational modes: training-time and run-time. A thick,horizontal line marks a conceptual boundary between these two modes,with “Training-Time” labeling a portion of FIG. 7 above the line, and“Run-Time” labeling a portion below the line. As a visual cue, variousarrows in the figure signifying information and/or processing flowand/or transmission are shown as dashed lines in the “Training-Time”portion of the figure, and as solid lines in the “Run-Time” portion.

During training, a training-time text string 701 from the speechdatabase 710 may be input to the text analysis module 702, which thengenerates training-time labels 705 (an enriched transcription of thetraining-time text string 701). Each training-time label could be madeup of a phonetic label identifying a phonetic speech unit (e.g., aphoneme), context information (e.g., one or more left-context andright-context phoneme labels, physical speech productioncharacteristics, linguistic context, etc.), and timing information, suchas a duration, relative timing position, and/or phonetic state model.

The training-time labels 705 are then input to the HMM module 704, whichmodels training-time predicted spectral parameters 711 and training-timepredicted excitation parameters 713. These may be considered speechfeatures that are generated by the HMM module according to statetransition probabilities and state emission probabilities that make up(at least in part) the HMM parameters. The training-time predictedspectral parameters 711 and training-time predicted excitationparameters 713 are then input to the HMM training module 714, as shown.

In further accordance with example embodiments, during training atraining-time speech signal 703 from the speech database 710 is input tothe feature extraction module 712, which processes the input signal togenerate expected spectral parameters 707 and expected excitationparameters 709. The training-time speech signal 703 is predetermined tocorrespond to the training-time text string 701; this is signified by awavy, dashed double arrow between the training-time speech signal 703and the training-time text string 701. In practice, the training-timespeech signal 701 could be a speech recording of a speaker reading thetraining-time text string 703. More specifically, the corpus of trainingdata in the speech database 710 could include numerous recordings of oneor more speakers reading numerous text strings. The expected spectralparameters 707 and expected excitation parameters 709 may be consideredknown parameters, since they are derived from a known speech signal.

During training time, the expected spectral parameters 707 and expectedexcitation parameters 709 are provided as input to the HMM trainingmodule 714. By comparing the training-time predicted spectral parameters711 and training-time predicted excitation parameters 713 with theexpected spectral parameters 707 and expected excitation parameters 709,the HMM training module 714 can determine how to adjust the HMMparameters 706 so as to achieve closest or optimal agreement between thepredicted results and the known results. While this conceptualillustration of HMM training may appear suggestive of a feedback loopfor error reduction, the procedure could entail a maximum likelihood(ML) adjustment of the HMM parameters. This is indicated by the returnof ML-adjusted HMM parameters 715 from the HMM training module 714 tothe HMM parameters 706. In practice, the training procedure may involvemany iterations over many different speech samples and correspondingtext strings in order to cover all (or most) of the phonetic speechunits of the language of the TTS speech synthesis system 700 withsufficient data to determine accurate parameter values.

During run-time operation, illustrated in the lower portion of FIG. 7(below thick horizontal line), a run-time text string 717 is input tothe text analysis module 702, which then generates run-time labels 719(an enriched transcription of the run-time text string 717). The form ofthe run-time labels 719 may be the same as that for the training-timelabels 705. The run-time labels 719 are then input to the HMM module704, which generates run-time predicted spectral parameters 721 andrun-time predicted excitation parameters 723, again according to theHMM-based technique.

The run-time predicted spectral parameters 721 and run-time predictedexcitation parameters 723 can generated in pairs, each paircorresponding to a predicted pair of feature vectors for generating atemporal frame of waveform data.

In accordance with example embodiments, the run-time predicted spectralparameters 721 and run-time predicted excitation parameters 723 may nextbe input to the speech synthesizer module 708, which may then synthesizea run-time speech signal 725. As an example, speech synthesize couldinclude a vocoder that can translate the acoustic features of the inputinto an output waveform suitable for playout on an audio output device,and/or for analysis by a signal measuring device or element. Such adevice or element could be based on signal measuring hardware and/ormachine language instructions that implement an analysis algorithm. Withsufficient prior training, the run-time speech signal 725 may have ahigh likelihood of being an accurate speech rendering of the run-timetext string 717.

b. Non-Parametric HMM-Based Voice Conversion

Returning for the moment to the TTS synthesis system 600 of FIG. 6, thesource HMM in the TTS subsystem 602 may be considered to be aconfiguration of HMM state models for modeling speech in a voice of asource speaker. In particular, the multivariate Gaussian PDFs formodeling features of each state may be determined by way of trainingusing recordings (or other speech signals) of a source speaker.

With the arrangement shown in FIG. 6, the TTS synthesis system 600 mayconvert an input text string 601 at run-time into an output utterance605 spoken in a voice modeled by the source HMM. In general, the sourcevoice may be different than that of a user who creates or provides theinput text string at run-time. While a user's interaction with, or useof, a TTS system such as the one illustrated in FIGS. 6 and/or 7 may notnecessarily involve speaking (at least when creating or providingtext-based input), it may be assumed that such a user has a speakingvoice. Accordingly the term “target speaker” will be used herein todesignate a user who provides or generates text-based input to a TTSsynthesis system, even if no actual speaking is involved in the processproviding text-based input. It is also possible that the target speakeris a virtual user instantiated by an executable program or application.In accordance with example embodiments, a TTS synthesis system (or, moregenerally, a speech synthesis system) can employ a technique entailingnon-parametric voice conversion for HMM-based speech synthesis in orderto synthesize speech that sounds like the voice a target speaker.

Example operation of non-parametric voice conversion for HMM-basedspeech synthesis is illustrated conceptually in FIG. 8. As shown, speechparameters may be extracted from target speaker speech signals 801 in afeature extraction process, yielding n=1, 2, 3, . . . , N target-speakervectors. The target-speaker vectors may be considered as forming atarget-speaker vector set 802-1, 802-2, 802-3, . . . , 802-N, whichcould be stored in a file or database, for example. In analytic terms,the target-speaker vector set may be said to form a vector space. Thetarget speaker speech signals, represented graphically in the figure asplots of waveforms, could be speech recordings of the target speaker.

FIG. 8 also depicts a source HMM 804 that includes a configuration of QHMM state models 804-1, 804-2, 804-3, 804-4, . . . , 804-Q. Asindicated, the source HMM 804 may be trained using recordings (or otherspeech waveforms) of a source speaker. The source HMM 804 could be used,for example, in a TTS synthesis system, such as the one illustrated inFIG. 6. Note that N may not, in general, be equal to Q, although thepossibility of equality is not necessarily excluded.

As indicated in the legend at the bottom left of FIG. 8, each HMM statemodel in FIG. 8 is represented pictorially as a sequence of three states(shown as circles) connected by arrows representing state transitions;each state has a self-transition represented by a circular arrow. Asymbolic PDF is shown above each state. The particular forms of thePDFs, as well as the representation of three states per HMM state modelis intended to be illustrative, and not necessarily limiting withrespect to embodiments herein.

In accordance with example embodiments, the PDFs of the HMM state modelsof the source HMM 804 may include multivariate Gaussian PDFs for jointlymodeling spectral envelope parameters, and multivariate Gaussian PDFsfor jointly modeling excitation parameters. Although this detail of thePDFs is not necessarily shown explicitly in the pictorial representationof the HMM states in FIG. 8, it may be assumed in references to PDFs inthe discussions below. It should also be noted, however, that otherforms of PDFs may be used in statistical modeling of speech, includingin other HMM-based techniques.

In accordance with example embodiments, a matching procedure is carriedout to determine a best match to each state model of the source HMM 804from among the target-speaker vector set 802-1, 802-2, 802-3, . . . ,802-N. The matching procedure is indicated in descriptive text,enumerated as step 1 in FIG. 8, and illustrated conceptually by arespective arrow pointing from just four example HMM states of thesource HMM 804 to the target-speaker vector set 802-1, 802-2, 802-3, . .. , 802-N. A similar search may be performed for all of the other HMMstate models of the output HMM 804. However, only four are depicted forthe sake of brevity in the figure. As described in more detail below,the matching procedure is involves a matching under transform techniquethat compensates for differences between the source speaker and thetarget speaker.

In further accordance with example embodiments, a mapping HMM, referredto herein as a “conversion” HMM 806, is constructed. As indicated by thedescriptive text enumerated as step 2 in in FIG. 8, the converted HMM isinitially just a copy of the source HMM. The converted HMM 806 thusincludes a configuration of Q HMM state models 806-1, 806-2, 806-3,806-4, . . . , 806-Q, as shown. These are initially identical to the QHMM state models of the source HMM 804.

Following initial construction of the converted HMM 806, the parametersof each of its state models is replaced by a particular target-speakervector from the target-speaker vector set 802-1, 802-2, 802-3, . . . ,802-N determined to be the closest match to the corresponding statemodel of the source HMM 804. The replacement operation is indicated indescriptive text, enumerated as step 3 in FIG. 8, and displayedpictorially by curved, dashed arrows from the target-speaker vector set802-1, 802-2, 802-3, . . . , 802-N to representative HMM states of theconverted HMM 806. A replacement may be performed for all of the HMMstate models of the converted HMM 806, although for the sake of brevityin the figure, only four curved, dashed arrows representing replacementsare shown explicitly.

After each of the HMM states of the initially constructed converted HMM806 have been replaced by the best matches from the target-speakervector set 802-1, 802-2, 802-3, . . . , 802-N, the converted HMM 806 isthen speech adapted to other characteristics of the target speaker notnecessarily represented in the target-speaker vectors. Morespecifically, an F0 transform may be computed that adapts statistics ofF0 of the source HMM 804 to the F0 statistics of the target speaker. Theadaptation can be made to match up to first order statistics (means),second order statistics (means and variances), or higher (e.g., matchingPDFs). The adaptation can be made directly on values of F0, or on themeans of Gaussian states of the HMMs.

The computed F0 transform can be applied to the converted HMM 806 toadapt the F0 statistics of the converted HMM 806 to the F0 statistics ofthe target speaker. For example, the means of the Gaussian states of theconverted HMM 806 may be transformed in this way. This adaptationoperation is indicated by as step 4 in descriptive text in FIG. 8. Theconverted HMM 806 can then be used in a TTS synthesis system, in placeof the source HMM. Synthesized speech could then sound like, or havecharacteristics of, the target speaker's voice.

In accordance with example embodiments, the operations illustrated byway of example in FIG. 8 could be carried out offline or otherwise priorto run-time application of a TTS synthesis system. In this way, a TTSsynthesis system could be prepared for use by multiple run-time users(or target speakers), and for multiple configurations of voiceconversion.

FIG. 9 depicts a block diagram of an example TTS synthesis system 900that implements non-parametric voice conversion, in accordance with anexample embodiment. The TTS synthesis system 900 is similar to the TTSsynthesis system 600 shown in FIG. 6, except that a converted HMMconfigured as described above is used for generating speech features forspeech synthesis. The functional components of the TTS synthesis system900 include a text analysis module 902 converting text in an enrichedtranscription, and a TTS subsystem 904, including a converted HMM, forgenerating synthesized speech from the enriched transcription. Inselected example inputs, outputs, and intermediate products of exampleoperation are also depicted.

Operation of the TTS synthesis system 900 is largely the same thatdescribed for the TTS synthesis system 600, except that the TTSsynthesis system 900 performs voice conversion that causes the outputvoice of the TTS subsystem 904 to sound like the voice of the targetspeaker. More specifically, n text string 901 generated and/or providedby a target speaker may be received by the text analysis module 902,which then generates an enriched transcription 903 as output. As in theexample in FIG. 6, the input text string 901 could be a text message,email, chat input, or other text-based communication, for example, andthe enriched transcription could correspond to a sequence of labels thatidentify speech units, including context information.

In accordance with example embodiments, the TTS subsystem 904 may thenemploy the converted HMM to generate feature vectors corresponding tothe enriched transcription 903. Finally, the TTS subsystem 904 canoutput synthesized speech 905 in a voice of a target speaker.

As described above, and in accordance with example embodiments, the HMMstates of the converted HMM are initially the same as those of thesource HMM are replaced by target-speaker vectors that are selected forbeing closest matches to HMM states of the source HMM. The matchingoperation is carried out under a transform that compensates fordifferences between the source speaker used to train the source HMM andthe target speaker. Further details of the matching under transformtechnique are described below.

c. Matching Under Transform

In general terms, voice conversion is concerned with converting thevoice of a source speaker to the voice of a target speaker. For purposesof the discussion herein, the target speaker is designated X, and thesource speaker is designated Y. These designations are intended forconvenience of discussion, and other designations could be used. In thecontext of speech modeling (e.g., recognition and/or synthesis), featureanalysis of speech samples of speaker X could generate a vector space ofspeech features, designated X-space. Similarly, feature analysis ofspeech samples of speaker Y could generate a vector space of speechfeatures, designated Y-space. For example, feature vectors couldcorrespond to parameterizations of spectral envelopes and/or excitation,as discussed above. In general, X-space and Y-space may be different.For example, they could have a different number of vectors and/ordifferent parameters. Further, they could correspond to differentlanguages, be generated using different feature extraction techniques,and so on.

Matching under transform may be considered a technique for matching theX-space and Y-space vectors under a transform that compensates fordifferences between speakers X and Y. It may be described in algorithmicterms as a computational method, and can be implemented asmachine-readable instructions executable by the one or more processorsof a computing system, such as a TTS synthesis system. Themachine-language instructions could be stored in one or another form ofa tangible, non-transitory computer-readable medium (or other article ofmanufacture), such as magnetic or optical disk, or the like, and madeavailable to processing elements of the system as part of amanufacturing procedure, configuration procedure, and/or executionstart-up procedure, for example.

By way of example, X-space may be taken to include N vectors, designated{right arrow over (x)}_(n)n=1, . . . , N. Similarly, Y-space may betaken to include Q vectors, designated {right arrow over (y)}_(q),q=1, .. . , Q. As noted, N and Q may not necessarily be equal, although thepossibility that they are is not precluded. In the context of speechmodeling, N and Q could correspond to a number of samples from speakersX and Y, respectively.

In accordance with example embodiments, matching under transform (MUT)uses a transformation function {right arrow over (y)}F ({right arrowover (x)}) to convert X-space vectors to Y-space vectors, and applies amatching-minimization (MM) operation within a deterministic annealingframework to match each Y-space vector with one X-space vector. Thetransformation function defines a parametric mapping from X-space toY-space. At the same time, a non-parametric, association mapping fromY-space to X-space may be defined in terms of conditional probabilities.Specifically, for a given X-space vector {right arrow over (x)}_(n) anda given Y-space vector {right arrow over (y)}_(q), an “associationprobability” p({right arrow over (x)}_(n)|{right arrow over (y)}_(q))may be used to specify a probability that {right arrow over (y)}_(q)maps to {right arrow over (x)}_(n). In this way, MUT involvesbi-directional mapping between X-space and Y-space: parametric in a“forward direction” (X→Y) via F(•), and non-parametric in the “backwarddirection” (Y→X) via p({right arrow over (x)}_(n)|{right arrow over(y)}_(q)).

A goal of MUT is to determine which X-space vectors {right arrow over(x)}_(n) correspond to a Y-space {right arrow over (y)}_(q) vector inthe sense that F({right arrow over (x)}) is close {right arrow over(y)}_(q) in L2-norm, and under the circumstance that F({right arrow over(x)}) and the probabilities p({right arrow over (x)}_(n)|{right arrowover (y)}_(q)) are not known ahead of time. Rather than searching forevery possible mapping between X-space and Y-space vectors, a distortionmetric between {right arrow over (x)}_(n) and {right arrow over (y)}_(q)may be defined as:

{right arrow over (d)}({right arrow over (y)} _(q) ,{right arrow over(x)} _(n))=({right arrow over (y)} _(q) −F({right arrow over (x)}_(n)))^(T) W _(q)({right arrow over (y)} _(q) −F({right arrow over (x)}_(n))),  [1]

where W₄ is a weighting matrix depending on Y-space vector {right arrowover (y)}_(q). Then taking p({right arrow over (x)}_(n)|{right arrowover (y)}_(q)) to be the joint probability of matching vectors {rightarrow over (y)}_(q) and {right arrow over (x)}_(n), an averagedistortion over all possible vector combinations may be expressed as:

D=Σ _(n,q) p({right arrow over (y)} _(q) ,{right arrow over (x)}_(n))d({right arrow over (y)} _(q) ,{right arrow over (x)} _(n))=Σ_(q)p({right arrow over (y)} _(q))Σ_(n) p({right arrow over (x)} _(n)|{right arrow over (y)} _(q))d({right arrow over (y)} _(q) ,{right arrowover (x)} _(n)).  [2]

In the MUT approach, the bi-directional mapping provides a balancebetween forward and backward mapping, ensuring convergence to ameaningful solution.

FIG. 10 is a conceptual illustration of parametric and non-parametricmapping between vector spaces, in accordance with example embodiments.The figure includes an X-space 1002, represented as an oval containingseveral dots, each dot symbolically representing an X-space vector(e.g., {right arrow over (x)}_(n)). Similarly, a Y-space 1004 isrepresented as an oval containing several dots, each dot symbolicallyrepresenting an Y-space vector (e.g., {right arrow over (y)}_(q)). Forpurposes of illustration, and by way of example, the two spaces areshown to contain a different number of vectors (dots). An arrow 1003from X-space to Y-space symbolically represents parametric mapping givenby {right arrow over (y)}=F({right arrow over (x)}). In the oppositedirection, an arrow 1005 from Y-space to X-space symbolically representsnon-parametric mapping via p({right arrow over (x)}_(n)|{right arrowover (y)}_(q)).

In accordance with example embodiments, minimizing the averagedistortion D simultaneously for F({right arrow over (x)}) and p({rightarrow over (x)}_(n)|{right arrow over (y)}_(q)) may be achieved usingtechniques of simulated annealing. Specifically, an uncertainty inprobabilistic matching between X-space and Y-space may be accounted forby an “association entropy,” which can be expressed asH(Y,X)=H(Y)+H(X|Y). Taking

${p\left( {\overset{\rightarrow}{y}}_{q} \right)} = \frac{1}{Q}$

so as to ensure that all Y-space vectors are accounted for equally, itfollows that H(Y) is constant. A composite minimization criterion D′ maythen be defined as:

D′=D−λH(X|Y),  [3]

where the entropy Lagrangian λ corresponds to an annealing temperature.

Minimizing D′ with respect to the association probabilities yields theassociations. In the general case of λ≠0, the association probabilitiesmay be expressed in the form of a Gibbs distribution and determined inwhat is referred to algorithmically herein as an “association step.”When λ approaches zero, the mapping between Y-space and X-space becomesmany to one (many Y-space vectors may be matched to one X-space vector).It can be shown in this case (λ→0) that the association probabilitiesmay be determined from a search for the nearest X-space vector in termsof the distortion metric d({right arrow over (y)}_(q),{right arrow over(x)}_(n)), in what is referred to algorithmically herein as a “matchingstep.”

Given the associations determined either by an association step or amatching step, the transform function can be defined and its optimalparameters determined by solving a minimization of D′ with respect tothe defined form of F(•). This determination of F({right arrow over(x)}) is referred to algorithmically herein as a “minimization step.”

The purpose of the transform is to compensate for speaker differencesbetween, in this example, speakers X and Y. More specifically,cross-speaker variability can be captured by a linear transform of theform {right arrow over (μ)}_(k)+Σ_(k){right arrow over (x)}_(n), where{right arrow over (μ)}_(k) is a bias vector, and Σ_(k) is lineartransformation matrix of the k-th class. The linear transform matrix cancompensate for differences in the vocal tract that are related to vocaltract shape and size. Accordingly, F({right arrow over (x)}) may bedefined as a mixture-of-linear-regressions function defined as:

F({right arrow over (x)} _(n))=Σ_(k=1) ^(K) p(k|{right arrow over (x)}_(n))[{right arrow over (μ)}_(k)+Σ_(k) {right arrow over (x)}_(n)],  [4]

where p(k|{right arrow over (x)}_(n)) is the probability that {rightarrow over (x)}_(n) belongs to the k-th class.

Assuming a class of probabilities p(k|{right arrow over (x)}_(n))corresponding to a Gaussian mixture model (GMM), and reformulatingΣ_(k){right arrow over (x)}_(n) using the vector operator vec{•} and theKronecker delta product to define {right arrow over (σ)}_(k)≡vec{Σ_(k)},it can be shown that F({right arrow over (x)}) may be expressed as:

$\begin{matrix}{\mspace{79mu} {{{F\left( {\overset{\rightarrow}{x}}_{n} \right)} = {{\begin{bmatrix}\Delta_{n} & B_{n}\end{bmatrix}\begin{bmatrix}\overset{\rightarrow}{\mu} \\\overset{\rightarrow}{\sigma}\end{bmatrix}} = {\Gamma_{n}\overset{\rightarrow}{\gamma}}}},\mspace{79mu} {where}}} & \lbrack 5\rbrack \\{{\Delta_{n} = \begin{bmatrix}{{p\left( {k = {1{\overset{\rightarrow}{x}}_{n}}} \right)}I} & {{p\left( {k = {2{\overset{\rightarrow}{x}}_{n}}} \right)}I} & \ldots & {{p\left( {k = {K{\overset{\rightarrow}{x}}_{n}}} \right)}I}\end{bmatrix}},} & \lbrack 6\rbrack \\{\mspace{79mu} {{\overset{\rightarrow}{\mu} = \begin{bmatrix}{\overset{\rightarrow}{\mu}}_{1}^{T} & {\overset{\rightarrow}{\mu}}_{2}^{T} & \ldots & {\overset{\rightarrow}{\mu}}_{K}^{T}\end{bmatrix}^{T}},}} & \lbrack 7\rbrack \\{{B_{n} = \begin{bmatrix}{{p\left( {k = {1{\overset{\rightarrow}{x}}_{n}}} \right)}X_{n}} & {{p\left( {k = {2{\overset{\rightarrow}{x}}_{n}}} \right)}X_{n}} & \ldots & {{p\left( {k = {K{\overset{\rightarrow}{x}}_{n}}} \right)}X_{n}}\end{bmatrix}},} & \lbrack 8\rbrack \\{\mspace{79mu} {\overset{\rightarrow}{\sigma} = {\begin{bmatrix}{\overset{\rightarrow}{\sigma}}_{1}^{\prime \; T} & {\overset{\rightarrow}{\sigma}}_{2}^{\prime \; T} & \ldots & {\overset{\rightarrow}{\sigma}}_{K}^{\prime \; T}\end{bmatrix}^{T}.}}} & \lbrack 9\rbrack\end{matrix}$

In the above expressions, I is the identity matrix (appropriatelydimensioned), {right arrow over (σ)}_(k)′≡vec{Σ_(k)′} contains only thefree parameters of the structured matrix Σ_(k), and Σ_(k){right arrowover (x)}_(n)=X_(n){right arrow over (σ)}_(k)′, The optimal {right arrowover (γ)} can then be obtained by partial differentiation, setting

$\frac{\partial D^{\prime}}{\partial\overset{\rightarrow}{\gamma}} = 0.$

Doing so yields the following unique solution:

{right arrow over (γ)}=−(Σ_(q) p({right arrow over (y)} _(q))Σ_(n)p({right arrow over (x)} _(n) |{right arrow over (y)} _(q))Γ_(n) ^(T) W_(q)Γ_(n))⁻¹(Σ_(q) p({right arrow over (y)} _(q))Σ_(n) p({right arrowover (x)} _(n) |{right arrow over (y)} _(q))Γ_(n) ^(T) W _(q) {rightarrow over (y)} _(q)).  [10]

Based on the discussion above, two algorithms may be used to obtainmatching under transform. The first is referred to herein as“association-minimization,” and the second is referred to herein as“matching-minimization.” In accordance with example embodiments,association-minimization may be implemented with the following steps:

1. Initialization.

2. Set λ to high value (e.g., λ=1).

3. Association step.

4. Minimization step.

5. Repeat from step 3 until convergence.

6. Lower λ according to a cooling schedule and repeat from step 3, untilλ approaches zero or other target value.

Initialization sets a starting point for MUT optimization, and maydiffer depending on the speech features used. For conversion ofmel-cepstral coefficient (MCEP) parameters, a search for a goodvocal-tract length normalization transform with a single linearfrequency warping factor may suffice. Empirical evidence suggests thatan adequate initialization transform is one that minimizes thedistortion in an interval [0.7, 1.3] of frequency warping factor. Theassociation step uses the Gibbs distribution function for theassociation probabilities, as described above. The minimization stepthen incorporates the transformation function. Steps 5 and 6 iterate forconvergence and cooling.

In further accordance with example embodiments, matching-minimizationmay be implemented with the following steps:

1. Initialization.

2. Matching step.

3. Minimization step.

4. Repeat from step 2 until convergence.

Initialization is the same as that for association-minimization,starting with a transform that minimizes the distortion in an intervalof values of [0.7, 1.3] in frequency warping factor. The matching stepuses association probabilities determined from a search for the nearestX-space vector, as described above. The minimization step thenincorporates the transformation function. Step 5 iterates forconvergence. Note that there is no cooling step, sincematching-minimization assumes λ=0.

In certain practical circumstances, matching-minimization may yieldcomparable results to association-minimization, but at lower computationcost. Accordingly, only matching-minimization is considered below forMUT. It will be appreciated that the techniques discussed below could begeneralized for application to association-minimization.

As described above in the context of HMM-based voice conversion,matching under transform may be used for determining the closestmatching target-speaker vector to each HMM state of the source HMM. Inaccordance with example embodiments, this matching may be accomplishedby implementing the matching-minimization algorithm described above. Forexample, as discussed in connection with FIG. 8, for the source HMMdiscussed above, each Gaussian state jointly models either the spectralenvelope parameters and their delta and delta-delta values, or theexcitation parameters and their delta and delta-delta values. Similarly,the target-speaker vectors include spectral envelope parameters andexcitation parameters of the target speaker, as extracted from speechsignals of the target speaker. The matching-minimization algorithm maybe applied separately for both spectral envelope parameters and forexcitation parameters in order to match Gaussian states of the sourceHMM and the target-speaker vectors. Since the procedure is largely thesame for both spectral envelope parameters and for excitationparameters, no explicit distinction is made between parameter types inrelation to the Gaussian states referenced in the discussion below.

For each Gaussian state of the source HMM, a correspondingtarget-speaker vector that when transformed is nearest to the mean ofthe Gaussian state in terms of a distance criterion based onmean-squared-error (mse) may be determined as follows. Matching is firstinitialized by scanning a range of values from 0.7 to 1.3 of linearwarping factor that minimizes the overall distortion. This may beaccomplished by resampling the relevant parameter (spectral envelope orexcitation) at the linearly warped frequency factor. Thematching-minimization algorithm may then be run with a single classtransform to obtain the matching between the transformed vectors and themeans of the HMM-state Gaussians of the source HMM. The single classtransform serves to compensate speaker differences. Note that thematching accounts for the means, the deltas, and the delta-deltas aswell.

The accuracy of the results of the matching-minimization procedure helpsensure high-quality voice conversion using the converted HMM with theHMM states replaced by those determined from the matching. Thematching-minimization procedure may also be implemented efficiently andcost-effectively, thereby contributing to overall scalability of a TTSsynthesis system that incorporates voice conversion and may be used bymany (e.g., millions of) users.

CONCLUSION

An illustrative embodiment has been described by way of example herein.Those skilled in the art will understand, however, that changes andmodifications may be made to this embodiment without departing from thetrue scope and spirit of the elements, products, and methods to whichthe embodiment is directed, which is defined by the claims.

What is claimed is:
 1. A method comprising: training an source hiddenMarkov model (HMM) based speech features generator implemented by one ormore processors of a system using speech signals of a source speaker,wherein the source HMM based speech features generator comprises aconfiguration of source HMM state models, each of the source HMM statemodels having a set of generator-model functions; extracting speechfeatures from speech signals of a target speaker to generate a targetset of target-speaker vectors; for each given source HMM state model ofthe configuration, determining a particular target-speaker vector fromamong the target set that most closely matches parameters of the set ofgenerator-model functions of the given source HMM; determining afundamental frequency (F0) transform that speech-adapts F0 statistics ofthe source HMM based speech features generator to match F0 statistics ofthe speech of the target speaker; constructing a converted HMM basedspeech features generator implemented by one or more processors of thesystem to be the same as the source HMM based speech features generator,but wherein the parameters of the set of generator-model functions ofeach source HMM state model of the converted HMM based speech featuresgenerator are replaced with the determined particular most closelymatching target-speaker vector from among the target set; andspeech-adapting F0 statistics of the converted HMM based speech featuresgenerator using the F0 transform to thereby produce a speech-adaptedconverted HMM based speech features generator.
 2. The method of claim 1,wherein the source HMM based speech features generator, and theconverted HMM based speech features generator are implemented by atleast one common processor from among the one or more processors of thesystem.
 3. The method of claim 1, further comprising: creating anenriched transcription of a run-time text string; using thespeech-adapted converted HMM based speech features generator to convertthe enriched transcription into corresponding output speech features;and generating a synthesized utterance of the enriched transcriptionusing the output speech features, the synthesized utterance having voicecharacteristics of the target speaker.
 4. The method of claim 3, whereinthe converted HMM based speech features generator is part of atext-to-speech (TTS) system, wherein creating the enriched transcriptionof the run-time text string comprises receiving the run-time text stringat the TTS system, and converting the received run-time text string intothe enriched transcription of the run-time text string by the TTSsystem, and wherein generating the synthesized utterance of the enrichedtranscription using the output speech features comprises synthesizingspeech by the TTS system.
 5. The method of claim 1, wherein the set ofgenerator-model functions for each given source HMM state modelcomprises a multivariate spectral probability density function (PDF) forjointly modeling spectral envelope parameters of a phonetic unit modeledby a given source HMM state model, and a multivariate excitation PDF forjointly modeling excitation parameters of the phonetic unit, and whereindetermining for each given source HMM state model the particulartarget-speaker vector from among the target set that most closelymatches parameters of the set of generator-model functions of the givensource HMM comprises: determining a target-speaker vector from among thetarget set that is computationally nearest to parameters of themultivariate spectral PDF of the given source HMM state model in termsof a distance criterion based on one of mean-squared-error (mse) orKullback-Leibler distance; and determining a target-speaker vector fromamong the target set that is computationally nearest to the multivariateexcitation PDF of the given source HMM state model in terms of adistance criterion based on one of mse or Kullback-Leibler distance. 6.The method of claim 5, wherein the multivariate spectral PDF of at leastone source HMM state model has the mathematical form of a multivariateGaussian function.
 7. The method of claim 5, wherein the phonetic unitis one of a phoneme or a triphone.
 8. The method of claim 5, wherein thespectral envelope parameters of the phonetic unit are Mel Cepstralcoefficients, Line Spectral Pairs, Linear Predictive coefficients, orMel-Generalized Cepstral Coefficients, and further include indicia offirst and second time derivatives of the spectral envelope parameters ofthe phonetic unit, and wherein extracting speech features from speechsignals of target the speaker comprises decomposing the speech signalsof target the speaker into at least one of spectral envelopes,aperiodicity envelopes, fundamental frequencies, or voicing.
 9. Themethod of claim 1, wherein determining for each given source HMM statemodel the particular target-speaker vector from among the target setthat most closely matches parameters of the set of generator-modelfunctions of the given source HMM comprises: making a determination anoptimal correspondence between a multivariate probability densityfunction (PDF) of the given source HMM and a particular target-speakervector from among the target set, the determination being made under atransform that compensates for differences between speech of the sourcespeaker and speech of the target speaker.
 10. The method of claim 1,wherein constructing the converted HMM based speech features generatorcomprises transforming the source HMM based speech features generatorinto the converted HMM based speech features generator by replacing theparameters of the set of generator-model functions of each source HMMstate model of the source HMM based speech features generator with thedetermined particular most closely matching target-speaker vector fromamong the target set, and wherein speech-adapting the F0 statistics ofthe converted HMM based speech features generator using the F0 transformcomprises speech-adapting the F0 statistics of the transformed sourceHMM based speech features generator using the F0 transform.
 11. A methodcomprising: implementing a source hidden Markov model (HMM) based speechfeatures generator by one or more processors of a system, wherein thesource HMM based speech features generator comprises a configuration ofsource HMM state models, each of the source HMM state models having aset of generator-model functions, and wherein the implemented source HMMbased speech features generator is trained using speech signals of asource speaker; providing a set of target-speaker vectors, the set oftarget-speaker vectors having been generated from speech featuresextracted from speech signals of a target speaker; implementing aconverted HMM based speech features generator that is the same as thesource HMM based speech features generator, but wherein (i) parametersof the set of generator-model functions of each given source HMM statemodel of the converted HMM based speech features generator are replacedwith a particular target-speaker vector from among the target set thatmost closely matches the parameters of the set of generator-modelfunctions of the given source HMM, and (ii) fundamental frequency (F0)statistics of the converted HMM based speech features generator arespeech-adapted using an F0 transform that speech-adapts F0 statistics ofthe source HMM based speech features generator to match F0 statistics ofthe speech of the target speaker; receiving an enriched transcription ofa run-time text string by an input device of the system; using theconverted HMM based speech features generator to convert the enrichedtranscription into corresponding output speech features; and generatinga synthesized utterance of the enriched transcription using the outputspeech features, the synthesized utterance having voice characteristicsof the target speaker.
 12. A system comprising: one or more processors;memory; and machine-readable instructions stored in the memory, thatupon execution by the one or more processors cause the system to carryout functions including: implementing a source hidden Markov model (HMM)based speech features generator by one or more processors of a system,wherein the source HMM based speech features generator comprises aconfiguration of source HMM state models, each of the source HMM statemodels having a set of generator-model functions, and wherein theimplemented source HMM based speech features generator is trained usingspeech signals of a source speaker; providing a set of target-speakervectors, the set of target-speaker vectors having been generated fromspeech features extracted from speech signals of a target speaker;implementing a converted HMM based speech features generator that is thesame as the source HMM based speech features generator, but wherein (i)parameters of the set of generator-model functions of each given sourceHMM state model of the converted HMM based speech features generator arereplaced with a particular target-speaker vector from among the targetset that most closely matches the parameters of the set ofgenerator-model functions of the given source HMM, and (ii) fundamentalfrequency (F0) statistics of the converted HMM based speech featuresgenerator are speech-adapted using an F0 transform that speech-adapts F0statistics of the source HMM based speech features generator to match F0statistics of the speech of the target speaker.
 13. The system of claim12, wherein the functions further include: creating an enrichedtranscription of a run-time text string; using the speech-adaptedconverted HMM based speech features generator to convert the enrichedtranscription into corresponding output speech features; and generatinga synthesized utterance of the enriched transcription using the outputspeech features, the synthesized utterance having voice characteristicsof the target speaker.
 14. The system of claim 13, wherein the system ispart of a text-to-speech (TTS) system, wherein creating the enrichedtranscription of the run-time text string comprises receiving therun-time text string at the TTS system, and converting the receivedrun-time text string into the enriched transcription of the run-timetext string by the TTS system, and wherein generating the synthesizedutterance of the enriched transcription using the output speech featurescomprises synthesizing speech by the TTS system.
 15. The system of claim12, wherein the set of generator-model functions for each given sourceHMM state model comprises a multivariate spectral probability densityfunction (PDF) for jointly modeling spectral envelope parameters of aphonetic unit modeled by a given source HMM state model, and amultivariate excitation PDF for jointly modeling excitation parametersof the phonetic unit, and wherein determining for each given source HMMstate model the particular target-speaker vector from among the targetset that most closely matches parameters of the set of generator-modelfunctions of the given source HMM comprises: determining atarget-speaker vector from among the target set that is computationallynearest to parameters of the multivariate spectral PDF of the givensource HMM state model in terms of a distance criterion based on one ofmean-squared-error (mse) or Kullback-Leibler distance; and determining atarget-speaker vector from among the target set that is computationallynearest to the multivariate excitation PDF of the given source HMM statemodel in terms of a distance criterion based on one of mse orKullback-Leibler distance.
 16. The system of claim 15, wherein themultivariate spectral PDF of at least one source HMM state model has themathematical form of a multivariate Gaussian function.
 17. The system ofclaim 15, wherein the phonetic unit is one of a phoneme or a triphone.18. The system of claim 15, wherein the spectral envelope parameters ofthe phonetic unit are Mel Cepstral coefficients, Line Spectral Pairs,Linear Predictive coefficients, or Mel-Generalized CepstralCoefficients, and further include indicia of first and second timederivatives of the spectral envelope parameters of the phonetic unit,and wherein extracting speech features from speech signals of target thespeaker comprises decomposing the speech signals of target the speakerinto at least one of spectral envelopes, aperiodicity envelopes,fundamental frequencies, or voicing.
 19. The system of claim 12, whereinthe particular target-speaker vector from among the target set that mostclosely matches parameters of the set of generator-model functions ofthe given source HMM comprises a particular target-speaker vector fromamong the target set that optimally corresponds to parameters of amultivariate PDF of the given source HMM state model, the optimalcorrespondence being determined under a transform that compensates fordifferences between speech of the source speaker and speech of thetarget speaker.
 20. The system of claim 12, wherein implementing theconverted HMM based speech features generator comprises: transformingthe implemented source HMM based speech features generator into theconverted HMM based speech features generator by replacing theparameters of the set of generator-model functions of each source HMMstate model of the source HMM based speech features generator with thedetermined particular most closely matching target-speaker vector fromamong the target set; and speech-adapting the F0 statistics of thetransformed source HMM based speech features generator using the F0transform.
 21. An article of manufacture including a computer-readablestorage medium having stored thereon program instructions that, uponexecution by one or more processors of a system, cause the system toperform operations comprising: implementing a source hidden Markov model(HMM) based speech features generator by one or more processors of asystem, wherein the source HMM based speech features generator comprisesa configuration of source HMM state models, each of the source HMM statemodels having a set of generator-model functions, and wherein theimplemented source HMM based speech features generator is trained usingspeech signals of a source speaker; providing a set of target-speakervectors, the set of target-speaker vectors having been generated fromspeech features extracted from speech signals of a target speaker;implementing a converted HMM based speech features generator that is thesame as the source HMM based speech features generator, but wherein (i)parameters of the set of generator-model functions of each given sourceHMM state model of the converted HMM based speech features generator arereplaced with a particular target-speaker vector from among the targetset that most closely matches the parameters of the set ofgenerator-model functions of the given source HMM, and (ii) fundamentalfrequency (F0) statistics of the converted HMM based speech featuresgenerator are speech-adapted using an F0 transform that speech-adapts F0statistics of the source HMM based speech features generator to match F0statistics of the speech of the target speaker.
 22. The article ofmanufacture of claim 21, wherein the operations further include:creating an enriched transcription of a run-time text string; using thespeech-adapted converted HMM based speech features generator to convertthe enriched transcription into corresponding output speech features;and generating a synthesized utterance of the enriched transcriptionusing the output speech features, the synthesized utterance having voicecharacteristics of the target speaker.
 23. The article of manufacture ofclaim 22, wherein the converted HMM based speech features generator ispart of a text-to-speech (TTS) system, wherein creating the enrichedtranscription of the run-time text string comprises receiving therun-time text string at the TTS system, and converting the receivedrun-time text string into the enriched transcription of the run-timetext string by the TTS system, and wherein generating the synthesizedutterance of the enriched transcription using the output speech featurescomprises synthesizing speech by the TTS system.
 24. The article ofmanufacture of claim 21, wherein the set of generator-model functionsfor each given source HMM state model comprises a multivariate spectralprobability density function (PDF) for jointly modeling spectralenvelope parameters of a phonetic unit modeled by a given source HMMstate model, and a multivariate excitation PDF for jointly modelingexcitation parameters of the phonetic unit, and wherein determining foreach given source HMM state model the particular target-speaker vectorfrom among the target set that most closely matches parameters of theset of generator-model functions of the given source HMM comprises:determining a target-speaker vector from among the target set that iscomputationally nearest to parameters of the multivariate spectral PDFof the given source HMM state model in terms of a distance criterionbased on one of mean-squared-error (mse) or Kullback-Leibler distance;and determining a target-speaker vector from among the target set thatis computationally nearest to the multivariate excitation PDF of thegiven source HMM state model in terms of a distance criterion based onone of mse or Kullback-Leibler distance.
 25. The article of manufactureof claim 24, wherein the multivariate spectral PDF of at least onesource HMM state model has the mathematical form of a multivariateGaussian function.
 26. The article of manufacture of claim 24, whereinthe phonetic unit is one of a phoneme or a triphone.
 27. The article ofmanufacture of claim 24, wherein the spectral envelope parameters of thephonetic unit are Mel Cepstral coefficients, Line Spectral Pairs, LinearPredictive coefficients, or Mel-Generalized Cepstral Coefficients, andfurther include indicia of first and second time derivatives of thespectral envelope parameters of the phonetic unit, and whereinextracting speech features from speech signals of target the speakercomprises decomposing the speech signals of target the speaker into atleast one of spectral envelopes, aperiodicity envelopes, fundamentalfrequencies, or voicing.
 28. The article of manufacture of claim 21,wherein the particular target-speaker vector from among the target setthat most closely matches parameters of the set of generator-modelfunctions of the given source HMM comprises a particular target-speakervector from among the target set that optimally corresponds toparameters of a multivariate PDF of the given source HMM state model,the optimal correspondence being determined under a transform thatcompensates for differences between speech of the source speaker andspeech of the target speaker.
 29. The article of manufacture of claim21, wherein implementing the converted HMM based speech featuresgenerator comprises: transforming the implemented source HMM basedspeech features generator into the converted HMM based speech featuresgenerator by replacing the parameters of the set of generator-modelfunctions of each source HMM state model of the source HMM based speechfeatures generator with the determined particular most closely matchingtarget-speaker vector from among the target set; and speech-adapting theF0 statistics of the transformed source HMM based speech featuresgenerator using the F0 transform.
 30. An article of manufactureincluding a computer-readable storage medium, having stored thereonprogram instructions that, upon execution by one or more processors of asystem, cause the system to perform operations comprising: training ansource hidden Markov model (HMM) based speech features generator usingspeech signals of a source speaker, wherein the source HMM based speechfeatures generator comprises a configuration of source HMM state models,each of the source HMM state models having a set of generator-modelfunctions; extracting speech features from speech signals of a targetspeaker to generate a target set of target-speaker vectors; for eachgiven source HMM state model of the configuration, determining aparticular target-speaker vector from among the target set that mostclosely matches parameters of the set of generator-model functions ofthe given source HMM; determining a fundamental frequency (F0) transformthat speech-adapts F0 statistics of the source HMM based speech featuresgenerator to match F0 statistics of the speech of the target speaker;constructing a converted HMM based speech features generator to be thesame as the source HMM based speech features generator, but wherein theparameters of the set of generator-model functions of each source HMMstate model of the converted HMM based speech features generator arereplaced with the determined particular most closely matchingtarget-speaker vector from among the target set; and speech-adapting F0statistics of the converted HMM based speech features generator usingthe F0 transform to thereby produce a speech-adapted converted HMM basedspeech features generator.